Exploring Red Wine Quality

Introduction

Red wine quality is explored, observed and analyzed in this project. The underlying objective is to understand the chemical properties that influence the quality of red wines. The statistical program, R, is used for this exploratory data analysis where the dataset can be found here and additional literature on the variables can be found here.

Univariate Plots Section

The following are some basic statistics on the dataset and the quality variable.

# Summary Statistics
str(wq)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
summary(wq)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
summary(wq$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

From the 1,599 wine observations across 13 numeric variables, it should be noted that X appears to be the unique identifier with quality being the primary output. It is based on a 10-point scale and was rated by at least three wine experts. Interestingly, the wine quality ranged from 3 to 8 with an average of 5.6 and a median of 6. This indicates that the quality variable is ordinal and discrete.

table(wq$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The following are histogram and boxplots for the 12 variables to kick off the data visualizations.

Looking at the histograms for all the features, it can be seen that density and pH are normally distributed as well as quality. These can be interesting relationships that will be explored further in subsequent sections.

Other plots seemed to be mostly skewed to the left. Though citric acid appears to have a high number of null values that is concerning. Residual sugar and chlorides seem to have long tails. Let’s see how this trends compare on boxplots next.

These boxplots confirmed many of the trends picked up in the histogram plots. The normal distribution for density and pH can be observed here as well. Likewise, residual sugar and chlorides have a lot of outliers. The distribution of citric acid is fairly odd. Perhaps, sub-setting out the null values might help.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The histogram is slightly better, but the boxplot doesn’t seem to have changed much for citric acid. This could be due to unreported or missing data.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 wine observations across 13 numeric variables where X is the unique identifier and fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality are the 12 features.

The first 11 variables are physicochemical data points on wine samples and the quality is an 10-point scale output based on sensory data from at least three wine experts.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. From the Univariate Plots Section, it can be observed that quality follows a near normal distribution where the bulk of the observations are in the 5-6 range with some outliers on either end. This can further outlined by using a more pronounced variable rating, such that a quality score of 0-4 denotes a Poor wine, a score of 5-6 denotes an Average wine, and a score of 7+ denotes a Good wine.

##    Poor Average    Good 
##      63    1319     217

Throughout this exploratory data analysis, the drivers of quality will be unearthed and examined.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Similar to quality, density and pH seem to be normally distributed. Fixed and volatile acidity, free and total sulphur dioxide, sulphates, and alcohol seem to be skewed and long-tailed. It is ambiguous as to what features directly affect quality, but from some high-level research, it appears that alcohol content, acidity and pH might be contributors to quality.

Further research failed to highlight the difference in benefit of the different types of acidity in wine. Thus, for the purpose of this project, fixed acid (tartaric acid), volatile acid (acetic acid) and citric acid were combined into a variable named, acidity. It should be also noted that the presence of sulphur dioxide and sulphates indicate the presence of sulphuric acid - this is ignored as being beyond the scope of this project.

Did you create any new variables from existing variables in the dataset?

A new variable, rating, was defined that categorized the wine quality ratings into Poor, Average, and Good buckets to illustrate its normal distribution. Lastly, a key variable, acidity was declared as a sum of fixed acidity, volatile acidity and citric acid. It is hypothesized that acidity is a driver of wine quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of citric acid is fairly unusual given that the distribution of fixed acidity and volatile acidity on a logarithmic scale conforms to the normal distribution of pH. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The dataset in general was fairly tidy such that additional wrangling was not needed.

Bivariate Plots Section

The bivariate plots began with a scatterplot matrix. Unfortunately, due to the large file size, generating such a plot took much too long. Instead, a sample of the dataset was used to begin the exploration. Still, the plot was just too messy to be of much use.

Scatterplot_matrices

The scatterplot matrix knitr chunk was almost silenced as the gigantic plot was too unwieldy to draw meaningful insights from. Nevertheless, the boxplots on rating and some of the correlations seem noteworthy. They were subsequently explored.

These boxplots provided some very interesting insights. It appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. The difference in behavior of the acids does bring into question the decision of having a combined acidity variable, but a better assessment will be made in subsequent section.

Lastly, it seems that density doesn’t play a significant part in wine quality. From it’s normal distribution in the univariate section, it was a feature of interest. Perhaps the correlation values might be more kind?

##                    X        fixed.acidity     volatile.acidity 
##           0.06645261           0.12405165          -0.39055778 
##          citric.acid       residual.sugar            chlorides 
##           0.22637251           0.01373164          -0.12890656 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH            sulphates              alcohol 
##          -0.05773139           0.25139708           0.47616632 
##              quality               rating              acidity 
##           1.00000000           0.81236704           0.10375373
##                    X        fixed.acidity     volatile.acidity 
##           0.11527163           0.11423756          -0.39124918 
##          citric.acid       residual.sugar            chlorides 
##                  NaN           0.02353331          -0.17613996 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05008749          -0.17014272          -0.17517368 
##                   pH            sulphates              alcohol 
##          -0.05757386           0.30864193           0.47698109 
##              quality               rating              acidity 
##           0.97556915           0.79200148           0.09282597

Correlation tests were performed on a plain and logarithmic scale. As expected, citric acid, alcohol and, to a lesser extent, fixed acidity had a positive correlation while volatile acidity had a negative correlation to quality. Interestingly, sulphates appeared to have a stronger correlation on a logarithmic scale, and pH seemed to be hardly correlated.

A couple more interesting insights were: - the extremely low correlation of acidity to quality at 10.4%. This proved to be somewhat of a dead end, unfortunately.
- density has a decent correlation of -17.5%. This isn’t the best, but enough to still be of interest.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the boxplots, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the correlation tests, similar trends were observed with the exception of the pH showing only about 5.7% correlation and suphates having a better correlation of 30.8%.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The acidity and sulphur dioxide relationships were examined.

There seems to be a trend between fixed acidity and citric acid, and volatile acidity and citric acid, but oddly there seems to be no relationship between fixed acidity and volatile acidity. This could be that the underlining chemistry are not dependent upon each other.

As a purely positive control test, the logarithmic relationship of acidity and pH were observed.

##        cor 
## -0.7044435

As expected, the higher the acidity, the lower the pH value with a correlation coefficient of 70.4%.

The relationship of free and total sulphur dioxide were investigated.

##       cor 
## 0.6676665

A correlation coefficient of 66.7% indicates that there is a fairly strong relationship between the two sulphur dioxide states. Some research, indicates that sulphur dioxide is an antimicrobial in wine making and that free sulphur dioxide originates from the total.

What was the strongest relationship you found?

The strongest relationship to quality were as follows: - alcohol: 47.6% - sulphates (log10): 30.9% - citric acid: 22.6% - fixed acidity: 12.4% - volatile acidity: -39.1% - density: -17.5%

Multivariate Plots Section

There were six features of interest from the bivariate plots. In this multivariate plot section, they were explored in further detail.

This is a really interesting plot. It appears that both alcohol and sulphates are necessary in a good wine.

Even with the null values removed, it is hard to pick out a decent trend.

These two plots were examined as it was believed that alcohol and volatile acidity would have an interesting interplay due to their polar correlation. The second plot proved to be very telling; it showed a clear distinction of the surface with poor wine (high volatile acidity and low alcohol content) and good wine (low volatile acidity and high alcohol content).

Density didn’t appear to yield much in terms of trend with alcohol.

Citric acid didn’t yield additional insights in visual trends with fixed acidity and volatile acidity.

Density proved to be a dead end. Due to their negative correlation with wine quality, it was expected that density and volatile acidity were correlated in some way. As seen in the plot, it was not so.

Surprisingly, pH had very little visual impact on wine quality, and was shadowed by the larger impact of alcohol.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the multivariate plots, the features that bore the strongest relationship to quality were observed by splitting the plots by quality score and faceting them by the three rating categories. It can be noted that higher alcohol, sulphates, citric acid, and fixed acidity, and lower volatile acidity leads to better wine quality. This is inline with the insights uncovered thus far.

Were there any interesting or surprising interactions between features?

Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid above clearly show their lack of correlation to each other.

To close off the discussion around pH, it can be visually observed to not be driver of wine quality when compared with the very obvious alcohol variable. Though, it should be noted that pH is dependent on the concentration of acids in wine, and as such doesn’t seem to vary far from the 3-4 range.


Final Plots and Summary

From the numerous plots above, it can be found that acidity, alcohol content and sulphates contribute to good wines. The final plots will illustrate these findings.

Plot One: Acidity on wine quality

It can be noted that not all acids are created equal. These boxplots illustrates that higher fixed acidity (tartaric acid) and citric acid are found in better quality wines. Furthermore, the absence of volatile acidity (acetic acid) also contributed to a higher wine quality. Therefore, a lower pH alone would be a red herring for wine quality. After all, higher acid concentration will lead to a lower pH value, but only tartaric and citric acid seem to benefit wine quality.

Plot Two: What is wine if it can’t get you drunk?

This scatterplot shows a trend of higher wine quality ratings with higher alcohol content and lower volatile acidity. Correlation tests performed indicated that alcohol and volatile acidity were the two most correlated features. The dotted lines represent the mean for each respective axes, whereby the bottom right quadrant has a high density of Good wine ratings.

Plot Three: Putting sulphates into perspective with alcohol content

This final plot is perhaps one of the most telling visualization as it illustrates that good wines have an abundance of sulphates and alcohol at the same time. The dotted lines represent the mean for each respective axes, whereby the top right quadrant has a high density of Good wine ratings.


Reflection

Exploratory data analysis proved to be very effective in understanding relationships within the red wine quality dataset. At the beginning of the analysis, various features were considered of interest, namely, density, pH, fixed acidity, volatile acidity, sulphates, and alcohol. The univariate plots were helpful in getting accustomed to the distribution of the features. But it was ultimately the bivariate plots that yielded key insights of where to examined closer. The multivariate plots revealed key trends that were extremely telling - they added a layer of detail over the bivariate plots that was very helpful and was thus favoured more so.

There were a few slight struggles and dead end throughout this project. The scatterplot matrix using ggplot was very combursome to plot. This was very likely due to dozen plus of variables that were attempted to be plotted, and as such was ineffective in illustrating trends and correlations. Instead, dedicated plots and correlation coefficient were generated against the quality feature. Beyond plotting difficulties, pH, density and the combined acidity variable proved to be dead ends. They were explored at length and with much promise, but untimately was fruitless in displaying any meaningful relationship.

It was found that fixed acidity, citric acid, alcohol content and sulphates positively drive wine quality, and volatile acidity negatively drive wine quality. Boxplots and scatterplots seemed to be the most telling visualization for this dataset. The final plots depict the relationship of acidity to a good wine, and most importantly, such a wine will likely come high alcohol content, high sulphates and low volatile acidity. The final plot also debunked the notion that pH in general was correlated to wine quality.

It should be noted that wine quality is highly subjective on a individual’s taste; a better study would be the inclusion of wine quantities sold in the market. Further analysis using inferential statistics and similar methodologies should be used to verify the findings in this exploration. Nevertheless, the plots here did uncover an interesting and telling story of wine quality in the available observations.


References